OPEN 3 ACCESS Freely available online 

New Biomarkers of Coffee Consumption Identified by the 
Non-Targeted Metabolomic Profiling of Cohort Study 
Subjects 

Joseph A. Rothwell 1 ' 29 , Yoann Fillatre 1 ' 2 ' 9 , Jean-Francois Martin 1 ' 3 , Bernard Lyan 1 ' 3 , Estelle Pujos- 
Guillot 1 ' 3 , Leopold Fezeu 4 , Serge Hercberg 4 , Blandine Comte 1,2 , Pilar Galan 4 , Mathilde Touvier 4 , 
Claudine Manach 1 2 

1 INRA, UMR 1019, Human Nutrition Unit, CRNH Auvergne, Clermont-Ferrand, France, 2Clermont University, Human Nutrition Unit, Clermont-Ferrand, France, 3INRA, 
Plateforme d'Exploration du Metabolisme, Clermont-Ferrand, France, 4 Paris 13 University, Sorbonne Paris Cite, Nutritional Epidemiology Research Team, Epidemiology 
and biostatistics Research Center, INSERM U1153, INRA U1125, CNAM, Paris 5 University, Paris 7 University, Bobigny, France 



Abstract 

Coffee contains various bioactives implicated with human health and disease risk. To accurately assess the effects of overall 
consumption upon health and disease, individual intake must be measured in large epidemiological studies. Metabolomics 
has emerged as a powerful approach to discover biomarkers of intake for a large range of foods. Here we report the 
profiling of the urinary metabolome of cohort study subjects to search for new biomarkers of coffee intake. Using repeated 
24-hour dietary records and a food frequency questionnaire, 20 high coffee consumers (183-540 mL/d) and 19 low 
consumers were selected from the French SU.VI.MAX2 cohort. Morning spot urine samples from each subject were profiled 
by high-resolution mass spectrometry. Partial least-square discriminant analysis of multidimensional liquid chromatography- 
mass spectrometry data clearly distinguished high consumers from low via 132 significant (p-value<0.05) discriminating 
features. Ion clusters whose intensities were most elevated in the high consumers were annotated using online and in- 
house databases and their identities checked using commercial standards and MS-MS fragmentation. The best 
discriminants, and thus potential markers of coffee consumption, were the glucuronide of the diterpenoid atractyligenin, 
the diketopiperazine cyclo(isoleucyl-prolyl), and the alkaloid trigonelline. Some caffeine metabolites, such as 1- 
methylxanthine, were also among the discriminants, however caffeine may be consumed from other sources and its 
metabolism is subject to inter-individual variation. Receiver operating characteristics curve analysis showed that the 
biomarkers identified could be used effectively in combination for increased sensitivity and specificity. Once validated in 
other cohorts or intervention studies, these specific single or combined biomarkers will become a valuable alternative to 
assessment of coffee intake by dietary survey and finally lead to a better understanding of the health implications of coffee 
consumption. 
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Introduction 

Coffee is one of the most widely consumed beverages in the 
world. It is consumed on a daily basis in most of the United States, 
Canada, Japan, New Zealand, and in Europe, where consumption 
is greatest in Scandanavian countries [1]. Due to its rich 
phytochemistry and frequent consumption, the beverage has a 
complex relationship with human health, and may be responsible 
for both negative and positive health effects [2,3] . Coffee intake is 
known to increase blood pressure [4,5] and when consumed in 
excess may also elevate the risk of cardiovascular disease [6]. 
Consumption has also recently been associated with increased all- 
cause mortality [7]. In contrast, recent epidemiological studies 
have suggested that regular coffee consumption could decrease the 



risk of type II diabetes, Parkinsonism, Alzheimer's disease, liver 
cancer, and even the risk of stroke [2,8-10]. 

Coffee contains several bioactives of potential importance to 
human health. Firstly, it is the major dietary source of the alkaloid 
stimulant caffeine, long believed to influence vascular health. 
Secondly, the beverage is rich in phenolic acids, particularly 
isomers of caffeoylquinic acid, the greatest contributor to 
polyphenol intake in European populations [11,12]. Thirdly, 
coffee contains a range of other potential bioactives whose 
metabolites may influence human health at lower concentrations. 
For example, coffee diterpenoids cafestol and kahweol may be 
chemopreventive but also raise cholesterol levels in healthy 
humans [13]. Risks and benefits may vary between individuals, 
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depending on individual risk factors for the diseases, genetic 
variation affecting caffeine metabolism and pharmacodynamics, 
background diet, and method of coffee preparation. The accurate 
assessment of coffee consumption in observational studies, as with 
other foods of dietary importance, is problematic. In the large- 
scale epidemiological studies needed to characterize the links 
between diet and health, intake data are collected by dietary 
questionnaires, which rely on accurate self-reporting by partici- 
pants. Despite advances in data collection techniques, bias remain 
a problem, and burdensome and expensive protocols using 
multiple 24 h dietary recalls and/or validated thorough food 
frequency questionnaires (FFQs) are necessary to obtain accurate 
assessment of food intake. 

A potential alternative to these assessment methods is the use of 
biomarkers, which if well-validated could reliably reflect the recent 
and/or habitual consumption of a food or dietary pattern of 
interest [14]. Until recendy, these could only be discovered ad-hoc 
based on knowledge of the metabolism of certain food components 
[15]. Metabolome profiling now allows the more efficient 
discovery of biomarkers of intake. Biofluids from low and high 
consumers or intervention and control subjects are profiled and 
compared, and the signals responsible for the variation investigat- 
ed as potential biomarkers [16]. Several biomarkers have been 
proposed for coffee intake from intervention studies but none has 
been evaluated yet in a large cohort study. The aim of the present 
study was to search for reliable biomarkers of coffee intake by 
profiling the biobanked urines of SU.VI.MAX2 cohort subjects 
who had reported either high or low habitual consumption. The 
use of samples from cohort studies rather than controlled 
interventions have allowed us to propose new candidate biomark- 
ers that are robust and stable enough to be measured in any type 
of study, as recendy demonstrated for the intake of citrus fruit [16]. 

Materials and Methods 

Standards and reagents 

Standards of caffeine, 1-methyluric acid, trigonelline (1- 
methylpyridinium-3-carboxylate) hydrochloride, hippuric acid, 
theophylline and paraxanthine were purchased from Sigma- 
Aldrich (L'Isle d'Abeau, France). 1,7-dimethyluric acid and 5- 
acetylamino-6-formylamino-3-methyluracil (AFMU) were sup- 
plied by Toronto Research Chemicals Inc (Canada). CycloQeu- 
cyl-prolyl) was obtained from Bachem (USA) and 3-hydroxyhip- 
puric acid was kindly provided by P.C.H. Hollman (Rikilt, 
Wageningen Univ., The Netherlands). For urine hydrolysis, /?- 
Glucuronidase - sulfatase from Helix pomatia (G0786) was 
purchased from Sigma-Aldrich (France). HPLC-grade acetonitrile 
and formic acid were supplied by Sigma-Aldrich (France) and 
water used for liquid chromatography-mass spectrometry (LC-MS) 
was generated from a Milli-Q integral water purification system. 

Subjects 

Study subjects were participants in the Supplementation en 
Vltamines Mineraux et AntioXydants 2 (SU.VI.MAX 2) cohort. 
SU.VI.MAX and SU.VI.MAX 2 followed 13,000 and 6850 
participants respectively between 1994 and 2009 to investigate the 
effect of nutrition on the quality of aging [17]. The study was 
conducted according to the guidelines of the Declaration of 
Helsinki and approved by the Ethical Committee for Studies with 
Human Subjects of Paris-Cochin Hospital (CCPPRB N° 706 and 
2364, respectively) and the Comite National Informatique et Liberie (N° 
334641 and 907094, respectively). All participants signed an 
informed consent form approved by the Ethical Committee. Every 
two months, participants were invited to complete a 24 h dietary 



record via the Minitel Telematic Network, a French telephone- 
based terminal equivalent to an Internet prototype widely used at 
the beginning of the study. The records were obtained between 
weeks and weekends in a 2:1 ratio and evenly between seasons to 
take into account intra-individual variability. Participants assessed 
portion sizes using a validated picture booklet. In 2009, the 
SU.VI.MAX2 subjects also completed a validated food frequency 
questionnaire (FFQ) [18]. For the PhenoMeNEp study (Phenotyp- 
ing using Metabolomics for Nutritional Epidemiology), 144 high 
and 66 low consumers of fruit and vegetables (F&V), balanced for 
age, gender, BMI, and season of sampling, were randomly selected 
from the highest and lowest quartiles of F&V consumption as 
reported both in six detailed 24 h records and one FFQ obtained 
from the cohort between 1998 and 2009. In anticipation of 
validating the statistical models externally, 50 additional subjects 
were selected from all quartiles of F&V consumption. For the 
present study, whose aim was to identify biomarkers of coffee 
intake, subgroups of 19 low and 20 high coffee consumers were 
further selected from the 210 SU.VI.MAX2 subjects. These were 
all subjects who had reported either high (> 180 mL/d) or no 
coffee consumption consistently in the dietary records and in the 
FFQ. The selection was based on the distribution of coffee intake, 
taking into account the correlation of consumption of other foods 
such as chocolate or red wine, as shown in Supporting Information 
SI. 

Profiling of urine samples 

For each of the 260 PhenoMeNEp subjects, one morning spot 
urine from the SU.VI.MAX2 biobank (collected between 2007 
and 2009) was profiled using UPLC-QTof-MS. Urine samples 
were centrifuged at 12,000 g for 4 min and diluted two-fold with 
Milli-Q purified water. Samples were profiled in a randomized 
sequence using a Waters Acquity UPLC module (Waters, 
Manchester, UK) coupled to a Waters QToF-Micro mass 
spectrometer equipped with an electrospray source operated both 
in positive (ESI+) and negative (ESI—) ionization modes, and a 
lock-mass sprayer to ensure accuracy. Details of the analysis 
conditions have been previously published [16]. 

Data pre-processing and statistical analysis 

FuU-scan UPLC-QTof-MS data for the 260 subjects were 
extracted and processed using XCMS software [19]. To correct 
for drift between different series, ion intensities were normalized 
using data from quality controls consisting of pooled study urine 
samples [20]. The datasets obtained for positive and negative 
modes were merged and then filtered to remove any ions that did 
not appear in more than 25% of the samples of at least one group 
(low or high F&V consumers). For the present study, statistical 
analysis was then carried out on a reduced dataset comprising the 
data matrix of the 19 low- and 20 high coffee consumers. Data 
were log transformed and an orthogonal signal correction (OSC) 
[21] filter with Pareto scaling was applied to all remaining ions to 
reduce variability not associated with the diet effect. Both 
univariate and multivariate statistical approaches were used. On 
the OSC filtered dataset, a one-way ANOVA (R software) was 
performed on each ion to search for differences in intensities 
between the selected low and high coffee consumers, applying a 
Benjamini-Hochberg (BH) [22] p-value correction to reduce the 
risk of false positive discovery. All ions with p-value<0.05 were 
considered statistically significant. In parallel, a partial least 
squares discriminant analysis (PLS-DA) was performed on all 
detected ions (SIMCA-P+ software, version 13.0, Umetrics AB, 
Umea, Sweden). The predictive ability of the PLS model was 
assessed by cross validation (Q 2 cum) and permutation test 
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(n= 100; plot and CV ANOVA). Variable Importance Projection 
(VIP) values were obtained as indicators of importance of each ion 
in the discrimination. 

Receiver Operating Characteristics (ROC) curves, widely 
considered to be the most objective and statistically valid method 
for the evaluation of biomarker performance, were constructed 
using the ROCCET web-based tool [23]. As with the previous 
statistical analysis, data used for ROC analyses were those filtered 
with OSC. The 20 high and 19 low-coffee consumers selected for 
the study were used as the training set and 10 high and 10 low- 
coffee consumers were selected among the validation population as 
the hold-out set. Both individual and multiple biomarker ROC 
curves were built using the support vector machines algorithm. 
The average of predicted class probabilities of each sample and the 
average predicted accuracy were then calculated across 1 00 cross- 
validations, giving the respective confusion matrices for the 
training and hold-out sets. 

The Pearson's correlation coefficient between the intensity of 
some discriminating ions and the declared coffee consumption was 
determined in the SU.VI.MAX2 sub-population of 260 subjects 
selected for the PhenoMeNEp project. 

Identification of discriminant ions 

Ions with ANOVA p-value less than 0.05, VIP greater than 1 
and a mean intensity ratio high consumers/low consumers greater 
than 1 (OSC filtered values) were retained for further investiga- 
tion. The list of discriminants was first grouped into clusters 
according to LC retention time (RT), and correlation between 
fragments was confirmed by a visual check of the corresponding 
extracted ion chromatograms (EIC) of the UPLC-QTof analysis 
and comparison of their distribution across the whole dataset. 

To obtain accurate masses, molecular formulae, and additional 
structural information, a small number of representative samples 
were also analyzed using a high resolution LTQ_ Orbitrap Velos™ 
hybrid mass spectrometer (Thermo Fisher Scientific, San Jose, 
CA) operating in various modes (i.e. full scan and collision induced 
dissociation fragmentation) using mass resolution from 7500 to 
30000. Chromatographic separation was achieved with a RSLC 
Ultimate 3000 liquid chromatography module (Dionex), using the 
same conditions as for the UPLC-QTof-MS analysis. 

The ions of interest were retrieved in the LTQ_-Orbitrap full 
scan chromatograms to obtain accurate masses for these signals, 
which generally allowed the determination of molecular formulae. 
Online databases such as the Human Metabolome Database 
(HMDB; www.hmdb.ca), KNApSAcK (http://kanaya.naist.jp/ 
knapsack_jsp/top.html), Dictionary of Natural Products (http:// 
dnp.chemnetbase.com) and Metlin (http://metlin.scripps.edu) 
were queried to hypothesize identities. In addition, a customized 
in-house database on coffee phytochemical metabolites was used, 
comprising all phytochemicals reported in coffee according to 
Phenol-Explorer (http://www.phenol-explorer.eu), Duke's Phyto- 
chemical and Ethnobotanical Database (http://www.ars-grin. 
gov/duke), the Dictionary of Natural Products, KNApSAcK and 
a literature survey, as well as their known metabolites reported in 
the literature. Additional likely human metabolites of coffee 
constituents were also included, as predicted in silico by the 
software Meteor-Nexus (v. 13.0.0, Lhasa Ltd, Leeds, UK). If a 
hypothesis was within 1 mDa of a chemically and biologically 
plausible coffee-derived metabolite, the commercial standard was 
acquired to perform the final confirmation of identity. If this was 
not available, fragmentation was performed on the LTQ;Orbitrap 
and the resulting spectra examined and compared to those 
available in online databases or literature. In silico prediction of the 
mass fragmentation of the candidate structures was also performed 



using Mass Frontier software (Thermo Fisher Scientific, San 
Jose, CA). The characterization of sulfated and glucuronidated 
derivatives was carried out by comparing profiles obtained before 
and after enzymatic hydrolysis of urine samples. For hydrolysis, 
urine samples (100 (U.1) were incubated overnight in sodium acetate 
buffer at pH 4.9 in the presence of (3-glucuronidase (1000 U) and 
sulfatase (45 U). 

Results 

The 20 high coffee consumers reported a median intake of 
290 mL/d (range: 183-540 mL/d), whereas all low consumers 
reported zero habitual consumption in all questionnaires. No 
significant differences in sex, age, season of urine sampling or BMI 
were observed between the two groups (Chi-squared p-val- 
ue = 0.408, 0.886, 0.069 and 0.869, respectively). 

After data pre-processing, 932 and 179 ions were detected in 
urine analyzed in positive and negative modes, respectively. Data 
were compared using both univariate and multivariate statistics. 
ANOVA with false positive discovery BH correction showed that 
1 1 9 and 1 3 ions in positive and negative modes respectively had 
significandy different intensities (p-value<0.05) in the groups of 
low and high coffee consumers. All significant ions except one {ml z 
258.905) had higher intensities in the group of high consumers 
than in the low, suggesting an exogenous origin for the majority of 
the significant ions. In parallel, the OSC-PLS-DA of the urine 
metabolomic profiles comprising all 1111 detected ions clearly 
distinguished the groups of low and high coffee consumers 
(Figure 1A). The calculated cumulative ratio Q_ 2 (cum) (0.849) 
combined with a low CV- ANOVA (1.77 x 10" 15 ) and a good 
permutation test plot (Figure 1C) indicated an excellent validation 
of the PLS model. This multivariate analysis revealed that 360 ions 
contributed to the discrimination (VIP>1), of which 47 were 
particularly resonant (VIP>2). Use of ANOVA BH p-values and 
the OSC-PLS-DA VIP values notably gave similar ion rankings, 
and all the 132 ions with ANOVA BH p-value<0.05 also had a 
VIP value>1.5. The 132 ions corresponded to around 60 
metabolites detected as clusters of correlated ions comprising 
fragments and adducts and sharing the same retention time. 

The strongest discriminating clusters are listed in Table 1 by 
ascending ANOVA BH p-value of the most significant ion in the 
cluster. Many of these were easily identified as caffeine metabolites 
(Figure 2), based on the comparison of accurate mass, RT and 
fragmentation spectra of unknown features and commercial 
standards. Among these were paraxanthine, a glucuronide of 
either paraxanthine or theophylline, 1-methylxanthine, 1-methy- 
luric acid, 1,7-dimethyluric acid, 1,3 or 3,7 dimethyluric acid, 
1,3,7-trimethyluric acid as well as 5-acetylamino-6-formylamino- 
3-methyluracil (AFMU). Of these, 1-methylxanthine (p-val- 
ue = 8.51 x 10~ 7 , VIP =2.71) contributed most strongly to the 
discrimination and was the third strongest discriminant found in 
the study overall, with a 5-fold greater mean intensity in the high 
consumers than in the low. 1,7-dimethyluric acid was similarly 
elevated in the high consumers (p-value = 8.5 1 x 10" 7 , VIP = 2.72, 
3.3-fold difference in mean intensity between the two groups). 

Also among the most discriminating ions were non-caffeine 
metabolites which represented even more promising candidates 
for biomarkers of coffee consumption (Figure 2). The most 
significant of these was a large cluster eluting at 11.3 min. The 
parent was tentatively identified as atractyligenin glucuronide 
based on the accurate masses of the parent and in-source 
fragments observed in the mass spectra obtained with high 
resolution mass spectrometry analysis of urine samples. The 
identification was further supported by the fragmentation pattern 
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Figure 1. Metabolomic profiling of spot urines from SU.VI.MAX2 subjects. Subjects reported either low or high consumption of coffee, 
represented by squares and circles respectively. A) One-dimensional OSC-PLS-DA score plot of urinary metabolomes of low and high consumers. B) 
Loading plot of the OSC-PLS-DA. Circled outlying ions contribute most strongly to the discrimination. C) Model validation assessed by permutation 
test (n = 100). 
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of ions of interest in hydrolyzed and non-hydrolyzed urine samples 
(Supporting Information S2). 

The ion m/z1\ 1.146, representing the second most significant 
metabolite, was tentatively identified as an isomer of the 
diketopiperazine cyclo(leucyl-prolyl), most likely the cyclo(isoleu- 
cyl-prolyl). The fragmentation spectrum was very similar to that of 
the cyclo(leucyl-prolyl) standard spiked in blank urine but with a 
slightly shorter retention time (Supporting Information S3). Very 
similar fragmentation patterns have previously been reported for 
the two compounds extracted from roasted coffee [24]. Another 
discriminant strongly correlated with coffee intake was the alkaloid 
trigonelline (m/z 138.055), which was identified by comparison of 
exact mass and RT with the authentic standard. Two other 
discriminants eluting at 9.7 (C^H^On) and 11.4min 
(C 2 6H 32 0|o) were found to be glucuronide conjugates with very 
similar fragmentation spectra. The most plausible hypothesis for 
the cluster eluting at 11.4 min was a glucuronide of a kahweol 
metabolite produced by oxidation of a primary alcohol. Oxidation 
was the most likely biotransformation predicted by Meteor 
software for kahweol. The cluster eluting at 9.7 min might be a 
hydrated analogue that lost a water moiety during ionisation. 
Hippuric acid and its 3-hydroxy derivative were also correlated 
with coffee consumption. 

The ratio of mean intensity in high consumers/ mean intensity 
in low consumers (Table 1) varied from 1.4 for 3-hydroxyhippuric 
acid to 17.5 for atractyligenin glucuronide, with values around 4—5 



for cyclo(isoleucyl-prolyl), kahweol oxide glucuronide, dimethyl- 
xanthine glucuronide and trigonelline. For some compounds (i.e. 
trigonelline, 1,7 dimethyluric acid, paraxanthine) the mean 
intensity in non-consumers was relatively high, suggesting other 
possible origins than coffee. 

The performance of the candidate biomarkers was assessed 
using ROC curves [23]. Table 1 shows the AUC and 95% CI 
obtained for the most important discriminants. According to the 
accepted classification of biomarker utility, candidate markers of 
AUO0.9 are considered "excellent", which was the case for 12 
of the 33 most discriminating markers (Table 1). Sixteen others 
were classified as "good" (0.8-0.9), 3 as "fair" (0.7-0.8) and 2 as 
"poor" (0.6-0.7). It is more accurate to consider the 95% CI, 
which gives a spread of possible values. ATMU, for example 
(AUC = 0.83), could be classified as a "good" biomarker. 
However, the lower 95% CI limit is 0.594. Lower 95% CI limits 
greater than 0.8 were observed for six discriminants only: 
atractyligenin glucuronide, cyclo(isoleucyl-prolyl), 1-methylxan- 
thine, 1 , 7-dimethyluric acid, kawheol oxide glucuronide and an 
unidentified discriminant (unknownl; p200.108). The perfor- 
mance of atractyligenin glucuronide and caffeine is illustrated in 
Figure 3. The results obtained with the validation population (10 
low and 10 high coffee consumers) confirmed that atractyligenin 
glucuronide would be a much more effective marker of coffee 
consumption than caffeine (AUC 0.95 vs 0.72; Figure 3A). 
Furthermore, the results of the permutation test (n = 500) showed 
that the model based on atractyligenin glucuronide is significant 
(p-value<0.002), whereas the one based on caffeine is not (p- 
value = 0.062). The caffeine based model was able to correctly 
classify most high coffee consumers (Figure 3B), but misclassified 
half of the low-consumers, either in the training set or hold-out set. 
Finally, the predictive model of caffeine is sufficiently sensitive but 
not specific (p = 0.04), unlike the atractyligenin glucuronide model 
(p = 0.0006). 

Since atractyligenin glucuronide is not commercially available, 
we tested the performance of a combination of cyclo(isoleucyl- 
prolyl), 1 -methylxanthine and trigonelline, the three best candi- 
date biomarkers commercially available. ROC curve analysis 
showed that this combination would be a more effective biomarker 
(better AUC and 95% CI) than any of the compounds alone 
(Figure 4). 

For the three new biomarkers revealed in the study, namely 
atractyligenin glucuronide, cyclo(isoleucyl-prolyl) and kahweol 
oxide glucuronide, Pearson's correlation between their intensity 
in urine and the declared coffee intake of the subjects was analyzed 
in a SU.VI.MAX2 group of 260 subjects and showed significant 
positive correlations (p<0.0001, R = 0.534, 0.543 and 0.561, 
respectively; Supporting Information S4). Pearson's correlation 
coefficient was also high for 1 -methylxanthine (0.508) and 
trigonelline (0.467) but low for caffeine (0.257), hippuric acid 
(0.199), and 3-hydroxyhippuric acid (0.214). 

Discussion 

Urine metabolomic profiles of well-characterized groups of high 
and low coffee consumers from the SU.VI.MAX2 cohort were 
easily distinguished (Figure 1). Sensitive and specific biomarkers of 
coffee intake could then be searched for among the discriminant 
features. Some of these features corresponded to previously 
reported markers of intake. Caffeine, first proposed as a marker 
of intake some decades ago [25], was indeed a discriminant of 
coffee intake in the SU.VI.MAX2 cohort, although a relatively 
weak one (p-value = 0.0266; VIP = 1 .65), and a moderately intense 
signal was found even in the urine of the low consumers. Some of 
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Figure 3. ROC curve analysis of atractyligenin glucuronide and caffeine. Data for atractyligenin glucuronide are presented in the left-hand 
column and data for caffeine in the right-hand column. A) Blue curves represent the training set (n = 39 subjects) and pink curves the hold-out set 
(n = 20 subjects). B) Probabilities of predicted belonging to the high consumer class. Training set, black plots; hold-out set, red plots; filled circles, high 
consumers; empty circles, low consumers. C) Confusion matrices for the two datasets. 
doi:1 0.1 371 /journal.pone.0093474.g003 



its metabolites were better discriminants, although caffeine is 
found in tea, cola, energy drinks and supplements, compromising 
specificity for coffee intake. Also, caffeine metabolism is known to 
be affected by various factors, including genetic variation in the 
CYP1A2 gene, and caffeine clearance can vary to up to 40-fold 
between individuals [26]. The resulting inter-individual variation 
in urinary caffeine metabolites is not desirable for biomarkers of 
intake. 

Trigonelline, another alkaloid, has previously been reported as a 
marker of coffee consumption in intervention participants who 
had consumed acute doses of coffee [27]. Trigonelline was the 7 th 
strongest discriminant between the groups in our study (p- 
value = 8.68xl0" 5 , VIP = 2.57), and the 2 nd most intense 
discriminant in urine of high coffee consumers. However, it is 



also found in alfalfa sprouts, lentils, chickpeas, oats and fenugreek 
[28,29] and is a plasma and urinary metabolite of niacin (vitamin 
B3). The trigonelline metabolite N-methylpyridinium, also report- 
ed as a marker of coffee intake [27], was not among the 
discriminants in the present study. Masses corresponding to 
cafestol and kahweol, the well-known coffee diterpenoids, were not 
observed among the discriminants, although a feature at [M+H] + 
329 was tentatively identified as a kahweol oxide glucuronide. The 
metabolism of cafestol and kahweol in humans is poorly 
documented. An intervention study in ileostomy volunteers 
showed that both are well absorbed in the small intestine but 
litde is excreted as conjugates of glucuronic acid or sulfate in urine, 
suggesting other metabolic routes such as oxidation [30]. 
Oxidation of the primary alcohol of kahweol with subsequent 
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Figure 4. ROC curve AUCs for single and combination 
biomarkers. Error bars represent 95% confidence intervals. cIP, 
cyclo(isoleucyl-prolyl); MX, 1-methylxanthine; Tr, trigonelline; Atr, 
atractyligenin glucuronides; Caf, caffeine. 
doi:1 0.1 371 /journal.pone.0093474.g004 

glucuronidafion of this alcohol was the first biotransformation 
predicted by Meteor software, supporting our tentative identifica- 
tion. If confirmed, kahweol oxide glucuronide may deserve further 
qualification as a new biomarker of coffee intake since the 
beverage is the only known dietary source of kahweol. It may, 
however, reflect only the consumption of unfiltered coffee such as 
expresso or Scandinavian-type boiled coffee, as brew preparations 
with paper filter have been shown to trap most of cafestol and 
kahweol in the filter [3 1] . 

Coffee is by far the greatest dietary source of hydroxycinnamic 
acids in human diets, and various chlorogenic acid isomers are 
absorbed and excreted in humans after coffee consumption 
[32,33]. Previous intervention studies on polyphenols have 
suggested that urinary chlorogenic acids could be specific 
biomarkers of coffee consumption [34,35] and the non-targeted 
metabolomic profiling of intervention subjects who had consumed 
acute doses of coffee [36] identified specific hydroxycinnamates as 
potential markers, of which dihydrocaffeic acid 3-sulfate and 
feruloylglycine might be the most promising due to their relatively 
long Tmax values (>4 h). Hydroxycinnamates and other phenols 
were detected in our study but did not contribute to the 
discrimination between high and low consumers. One explanation 
might be insufficient specificity for coffee consumption, since they 
are also widely consumed from fruits. Hydroxycinnamate metab- 
olites could be useful as compliance biomarkers in controlled 
intervention studies but of limited use in cohort studies where 
subjects freely consume a variety of plant foods, and especially 
when only spot urines are available, since compounds with short 
half-life are probably not recovered in these samples. Hippuric 
acid and 3-hydroxyhippuric acid are end-products of the microbial 
catabolism of numerous polyphenols and aromatic amino acids. 
They have also been reported as discriminant for many 
physiopathological conditions or for exposures to chemical toxins 
in metabolomics studies [37]. Despite coffee being the richest 
dietary source of chlorogenic acids which are degraded to hippuric 
and hydroxyhippuric acids, they cannot be considered reliable 
biomarkers of coffee intake. 

Beyond the known coffee phytochemical metabolites described 
above, our data-driven approach revealed some novel candidate 
biomarkers with high specificity and sensitivity. Atractyligenin 
glucuronide was the strongest discriminant in the study (p- 



value = 7.47 xl0~ 9 , VIP = 2.98), and also demonstrated the 
greatest mean difference in intensity between high and low- 
consumers (17-fold). Atractyligenin is a diterpenoid whose 
glycoside, atractyloside, and other derivatives are present in green 
and roasted Coffea arabica beans in concentrations as high as 
624 mg/kg [38,39]. Atractyloside derivatives are also present in 
many plants used in ethnomedicines, but have not been reported 
in any other human foodstuff. Atractyloside is well known as an 
exceptionally specific and effective inhibitor of the ADP/ATP 
transport in mitochondria, able to block oxidative phosphorylation 
(34). Atractyligenin glucuronide has been identified in the urine of 
habitual coffee consumers at 3 ug/mL [40], but is proposed for 
the first time here as a biomarker of coffee intake. Its potential 
contribution to coffee health effects may also deserve further 
attention. The diketopiperazine cyclo(isoleucyl-prolyl) was the 
second most discriminating ion in the study (p-value =1.61x10 , 
VIP = 2.81) and thus another potential biomarker of consumption. 
It is known to be a bitter constituent of coffee [24] . Diketopiper- 
azines have not previously been proposed as biomarkers of 
consumption, although urinary cyclo(pro-pro) and cyclo(ser-tyr) 
have been found to correlate with high cocoa consumption [41]. 

Metabolomics is essentially an exploratory approach, with some 
limitations. No method of profiling can cover the full chemical 
space of the food metabolome, and identification of unknowns is 
challenging [42]. Variation in study design and data processing 
may also lead to the discovery of different biomarkers. However, 
when the objective is the discovery of a biomarker of food intake, 
and not the comprehensive characterization of nutritional 
exposure following intake, it is not necessary to identify all 
discriminants associated with the consumption of the food, but one 
or a few promising candidate biomarkers only. In the present 
study, ROC curve analyses demonstrated that when an effective 
single biomarker is not found or not available as standard, a 
combination of several discriminants can provide a sensitive and 
specific biomarker. 

The present work also showed that metabolomic profiling of 
urine samples can be applied to cohort study subjects to efficiently 
discover biomarkers of food consumption. A clear discrimination 
was achieved with relatively few subjects, but more may be 
required for other foods depending on their composition and 
patterns of consumption. The use of cohort samples can reveal 
more robust biomarkers than intervention studies, since they do 
not rely on a prescribed timeframe of sampling or exaggerated 
doses of the food of interest. Here, the absence of coffee 
hydroxycinnamate metabolites as discriminants support previous 
findings that many candidate biomarkers discovered in interven- 
tion studies may not be specific or robust enough for use in cross- 
sectional studies [16]. However, care must be taken when using 
cross-sectional studies for biomarker discovery since intake of other 
foods may correlate with that of the foods of interest. All possible 
dietary and metabolic origins must be carefully checked for any 
proposed biomarker. 

Biomarker validation is a laborious process requiring dose- 
response and pharmacokinetics studies, attention to specificity and 
association with intake in various populations with different ethnic 
and dietary backgrounds, and consideration of the main factors 
affecting the relationship between the biomarker concentration 
and the quantity of food consumed. Our three new biomarkers 
were closely correlated with reported coffee intake for 260 subjects 
from the same SU.VI.MAX2 cohort, despite the semi-quantitative 
measurement of biomarker intensity. The correlations could be 
even more accurate if intake data were collected at the same time 
as urine samples in which the biomarker is quantified. 
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Coffee is an ideal foodstuff for which to search for biomarkers of 
consumption. It is rich in highly specific bioactives and its 
consumption is frequent in high consumers, meaning that urine 
concentrations of bioactives are often high. The challenge that lies 
ahead will be using metabolomics profiling of cohort study samples 
to identify biomarkers for the intake of many foods less rich in 
specific micro-constituents and consumed as part of complex 
dietary patterns. 
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